sd 1
CLT-Optimal Parameter Error Bounds for Linear System Identification
There has been remarkable progress over the past decade in establishing finite-sample, non-asymptotic bounds on recovering unknown system parameters from observed system behavior. Surprisingly, however, we show that the current state-of-the-art bounds do not accurately capture the statistical complexity of system identification, even in the most fundamental setting of estimating a discrete-time linear dynamical system (LDS) via ordinary least-squares regression (OLS). Specifically, we utilize asymptotic normality to identify classes of problem instances for which current bounds overstate the squared parameter error, in both spectral and Frobenius norm, by a factor of the state-dimension of the system. Informed by this discrepancy, we then sharpen the OLS parameter error bounds via a novel second-order decomposition of the parameter error, where crucially the lower-order term is a matrix-valued martingale that we show correctly captures the CLT scaling. From our analysis we obtain finite-sample bounds for both (i) stable systems and (ii) the many-trajectories setting that match the instance-specific optimal rates up to constant factors in Frobenius norm, and polylogarithmic state-dimension factors in spectral norm.
- North America > United States > California > Los Angeles County > Los Angeles (0.28)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
- (2 more...)
Symmetry Guarantees Statistic Recovery in Variational Inference
Marks, Daniel, Paccagnan, Dario, van der Wilk, Mark
Variational inference (VI) is a central tool in modern machine learning, used to approximate an intractable target density by optimising over a tractable family of distributions. As the variational family cannot typically represent the target exactly, guarantees on the quality of the resulting approximation are crucial for understanding which of its properties VI can faithfully capture. Recent work has identified instances in which symmetries of the target and the variational family enable the recovery of certain statistics, even under model misspecification. However, these guarantees are inherently problem-specific and offer little insight into the fundamental mechanism by which symmetry forces statistic recovery. In this paper, we overcome this limitation by developing a general theory of symmetry-induced statistic recovery in variational inference. First, we characterise when variational minimisers inherit the symmetries of the target and establish conditions under which these pin down identifiable statistics. Second, we unify existing results by showing that previously known statistic recovery guarantees in location-scale families arise as special cases of our theory. Third, we apply our framework to distributions on the sphere to obtain novel guarantees for directional statistics in von Mises-Fisher families. Together, these results provide a modular blueprint for deriving new recovery guarantees for VI in a broad range of symmetry settings.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Rhode Island > Providence County > Providence (0.04)
- North America > United States > California (0.04)
- (2 more...)
Minimaxity and Admissibility of Bayesian Neural Networks
Coulson, Daniel Andrew, Wells, Martin T.
Bayesian neural networks (BNNs) offer a natural probabilistic formulation for inference in deep learning models. Despite their popularity, their optimality has received limited attention through the lens of statistical decision theory. In this paper, we study decision rules induced by deep, fully connected feedforward ReLU BNNs in the normal location model under quadratic loss. We show that, for fixed prior scales, the induced Bayes decision rule is not minimax. We then propose a hyperprior on the effective output variance of the BNN prior that yields a superharmonic square-root marginal density, establishing that the resulting decision rule is simultaneously admissible and minimax. We further extend these results from the quadratic loss setting to the predictive density estimation problem with Kullback--Leibler loss. Finally, we validate our theoretical findings numerically through simulation.
- Europe > Austria > Vienna (0.14)
- Europe > United Kingdom (0.04)
- Asia > Middle East > UAE (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Homogenized Transformers
Koubbi, Hugo, Geshkovski, Borjan, Rigollet, Philippe
We study a random model of deep multi-head self-attention in which the weights are resampled independently across layers and heads, as at initialization of training. Viewing depth as a time variable, the residual stream defines a discrete-time interacting particle system on the unit sphere. We prove that, under suitable joint scalings of the depth, the residual step size, and the number of heads, this dynamics admits a nontrivial homogenized limit. Depending on the scaling, the limit is either deterministic or stochastic with common noise; in the mean-field regime, the latter leads to a stochastic nonlinear Fokker--Planck equation for the conditional law of a representative token. In the Gaussian setting, the limiting drift vanishes, making the homogenized dynamics explicit enough to study representation collapse. This yields quantitative trade-offs between dimension, context length, and temperature, and identifies regimes in which clustering can be mitigated.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- Asia > Middle East > Jordan (0.04)
Dataset Distillation Efficiently Encodes Low-Dimensional Representations from Gradient-Based Learning of Non-Linear Tasks
Kinoshita, Yuri, Nishikawa, Naoki, Toyoizumi, Taro
Dataset distillation, a training-aware data compression technique, has recently attracted increasing attention as an effective tool for mitigating costs of optimization and data storage. However, progress remains largely empirical. Mechanisms underlying the extraction of task-relevant information from the training process and the efficient encoding of such information into synthetic data points remain elusive. In this paper, we theoretically analyze practical algorithms of dataset distillation applied to the gradient-based training of two-layer neural networks with width $L$. By focusing on a non-linear task structure called multi-index model, we prove that the low-dimensional structure of the problem is efficiently encoded into the resulting distilled data. This dataset reproduces a model with high generalization ability for a required memory complexity of $\tildeΘ$$(r^2d+L)$, where $d$ and $r$ are the input and intrinsic dimensions of the task. To the best of our knowledge, this is one of the first theoretical works that include a specific task structure, leverage its intrinsic dimensionality to quantify the compression rate and study dataset distillation implemented solely via gradient-based algorithms.
Hard labels sampled from sparse targets mislead rotation invariant algorithms
Ghosh, Avrajit, Yu, Bin, Warmuth, Manfred, Bartlett, Peter
One of the most common machine learning setups is logistic regression. In many classification models, including neural networks, the final prediction is obtained by applying a logistic link function to a linear score. In binary logistic regression, the feedback can be either soft labels, corresponding to the true conditional probability of the data (as in distillation), or sampled hard labels (taking values $\pm 1$). We point out a fundamental problem that arises even in a particularly favorable setting, where the goal is to learn a noise-free soft target of the form $σ(\mathbf{x}^{\top}\mathbf{w}^{\star})$. In the over-constrained case (i.e. the number of samples $n$ exceeds the input dimension $d$) with examples $(\mathbf{x}_i,σ(\mathbf{x}_i^{\top}\mathbf{w}^{\star}))$, it is sufficient to recover $\mathbf{w}^{\star}$ and hence achieve the Bayes risk. However, we prove that when the examples are labeled by hard labels $y_i$ sampled from the same conditional distribution $σ(\mathbf{x}_i^{\top}\mathbf{w}^{\star})$ and $\mathbf{w}^{\star}$ is $s$-sparse, then rotation-invariant algorithms are provably suboptimal: they incur an excess risk $Ω\!\left(\frac{d-1}{n}\right)$, while there are simple non-rotation invariant algorithms with excess risk $O(\frac{s\log d}{n})$. The simplest rotation invariant algorithm is gradient descent on the logistic loss (with early stopping). A simple non-rotation-invariant algorithm for sparse targets that achieves the above upper bounds uses gradient descent on the weights $u_i,v_i$, where now the linear weight $w_i$ is reparameterized as $u_iv_i$.
- Research Report > New Finding (0.89)
- Research Report > Experimental Study (0.75)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.55)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.34)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)
High-dimensional estimation with missing data: Statistical and computational limits
Verchand, Kabir Aladin, Pensia, Ankit, Haque, Saminul, Kuditipudi, Rohith
We consider computationally-efficient estimation of population parameters when observations are subject to missing data. In particular, we consider estimation under the realizable contamination model of missing data in which an $ε$ fraction of the observations are subject to an arbitrary (and unknown) missing not at random (MNAR) mechanism. When the true data is Gaussian, we provide evidence towards statistical-computational gaps in several problems. For mean estimation in $\ell_2$ norm, we show that in order to obtain error at most $ρ$, for any constant contamination $ε\in (0, 1)$, (roughly) $n \gtrsim d e^{1/ρ^2}$ samples are necessary and that there is a computationally-inefficient algorithm which achieves this error. On the other hand, we show that any computationally-efficient method within certain popular families of algorithms requires a much larger sample complexity of (roughly) $n \gtrsim d^{1/ρ^2}$ and that there exists a polynomial time algorithm based on sum-of-squares which (nearly) achieves this lower bound. For covariance estimation in relative operator norm, we show that a parallel development holds. Finally, we turn to linear regression with missing observations and show that such a gap does not persist. Indeed, in this setting we show that minimizing a simple, strongly convex empirical risk nearly achieves the information-theoretic lower bound in polynomial time.
- North America > United States > California (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Workflow (0.92)
- Research Report > New Finding (0.45)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Asia > Middle East > Israel (0.04)
SupplementaryMaterials
We first prove the direction Z T SI(Z;T) = 0, which is equivalent to prove I(Z;T) = 0 SI(Z;T) = 0. We prove the contrapositive, i.e. rather than show LHS = RHS, we show that RHS = LHS. Now assume that supwi,vj ρ(w i Z i,v j T j) > ϵ for some i,j. Then by setting those elements in w,v unrelated to Z i,T j to zero, and those related to Z i,T j exactlythesameaswi,vj,weknowthatsupw,vρ(w Z,v T) > ϵ. All neural networks are trained by Adam with its default settings and a learning rate η = 0.001. Early stopping is an useful technique for avoiding overfitting, however it needs to be carefully considered when applied to adversarial methods.
- Europe > France (0.05)
- North America > United States > Virginia > Albemarle County > Charlottesville (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Asia > Middle East > Israel (0.04)